- ID : Customer ID
- Age : Customer's age in completed years.
- Experience : #years of professional experience.
- Income : Annual income of the customer.
- ZIP Code : Home Address ZIP code.
- Family : Family size of the customer.
- CCAvg : Avg. spending on credit cards per month
- Education : Education Level.
- Mortgage : Value of house mortgage.
- Personal Loan : Customer who accepted the personal loan offered in the last campaign.
- Securities Account : Customer having a securities account with the bank.
- CD Account : Customer having a certificate of deposit (CD) account with the bank.
- Online : Customer using internet banking facilities.
- Credit card : Customer using a credit card issued by Thera Bank.
import pandas as pd
import pandas_profiling
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier
from scipy.stats import zscore
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn import preprocessing
from sklearn.metrics import average_precision_score, confusion_matrix, accuracy_score, classification_report, plot_confusion_matrix
import warnings
warnings.filterwarnings('ignore')
# reading csv file
bp = pd.read_csv("Bank.csv")
bp
bp.profile_report()
bp.info()
The data contain 5000 rows and 14 column with no null values and showing data-type of each column
- The minimum.
- Q1 (the first quartile, or the 25% mark).
- The median (50%).
- Q3 (the third quartile, or the 75% mark).
- The maximum.
bp.describe()
Checking If Negative Values present
bpt=bp
for x in bpt:
bp[x]=bp[x].astype(float)
att = bp[x]
#main
a = att.lt(0).sum()
outlier = []
outliervalue = []
if a != 0:
print(x,'has Negative Value of',a)
#Checking Negative Values of Experience through graph
print('Negative Valued distribution of Experience column')
sns.distplot(bp['Experience'], kde=True)
Analyzing Negative Value
bpt=bp['Experience']
nv = []
for x in bpt:
if x < 0:
nv.append(x)
print('Experience has Negative Value with total count',len(nv))
def unique(list1):
# intilize a null list
unique_list = []
# traverse for all elements
for x in nv:
# check if exists in unique_list or not
if x not in unique_list:
unique_list.append(x)
# print list
for x in unique_list:
print ('\nUnique Negative value :',x)
unique(nv)
print('\nReplaced Negative values of Experience with 0')
for i in range(len(bpt)):
if bpt[i] < 0:
bpt[i] = 0
exp_mean = bp['Experience'].median()
#print('\nMean of Experience',exp_mean)
print('Replacing all 0 values of Experience with Mean of the column.\nMean of the column is:',exp_mean)
bpt=bp['Experience']
for i in range(len(bpt)):
if bpt[i] == 0:
bpt[i] = exp_mean
#Checking adjusted of valuesof Experience through graph
print('Adjusted distribution of Experience column')
sns.distplot(bp['Experience'], kde=True)
bp.nunique()
for x in bp:
bp[x]=bp[x].astype(float)
att = bp[x]
mean = np.mean(att)
std = np.std(att)
#print('\nmean of',x,' the dataset is', mean)
# print('\nstd. deviation of',x,' is', std)
outlier = []
outliervalue = []
for i in att:
z = (i-mean)/std
if z < -3.00 or z > 3.00:
outlier.append(i)
outliervalue.append(z)
print('No. of outlier of',x,'in dataset is',len(outlier))
plt.figure(figsize= (30,70))
plt.subplot(13,2,1)
sns.boxplot(x= bp.ID, color='lightblue')
plt.xlabel('ID')
plt.subplot(13,2,2)
sns.boxplot(x= bp.Age, color='blue')
plt.xlabel('AGE')
plt.subplot(13,2,3)
sns.boxplot(x= bp.Income, color='green')
plt.xlabel('INCOME')
plt.subplot(13,2,4)
sns.boxplot(x= bp['ZIP Code'], color='purple')
plt.xlabel('ZIP CODE')
plt.subplot(13,2,5)
sns.boxplot(x= bp.Family, color='yellow')
plt.xlabel('FAMILY')
plt.subplot(13,2,6)
sns.boxplot(x= bp.CCAvg, color='lightblue')
plt.xlabel('CCAVG')
plt.subplot(13,2,7)
sns.boxplot(x= bp.Education, color='lightblue')
plt.xlabel('EDUCATION')
plt.subplot(13,2,8)
sns.boxplot(x= bp.Mortgage, color='green')
plt.xlabel('Mortgage')
plt.subplot(13,2,9)
sns.boxplot(x= bp['Personal Loan'], color='blue')
plt.xlabel('PERSONAL LOAN')
plt.subplot(13,2,10)
sns.boxplot(x= bp['Securities Account'], color='yellow')
plt.xlabel('SECURITIES ACCOUNT')
plt.subplot(13,2,11)
sns.boxplot(x= bp['CD Account'], color='purple')
plt.xlabel('CD ACCOUNT')
plt.subplot(13,2,12)
sns.boxplot(x= bp.Online, color='blue')
plt.xlabel('ONLINE')
plt.subplot(13,2,13)
sns.boxplot(x= bp['CreditCard'], color='green')
plt.xlabel('CREDTI CARD')
plt.show()
for x in bp:
series = bp[x]
skewness = series.skew()
if skewness > -.25 and skewness < .25 :
print(x,"is Symmetrically Skewed as Skewness =",round(skewness,3),'\n')
elif skewness > .25:
print(x,"is Positively Skewed towards Right side of asymmetric distribution as Skewness =",round(skewness,3),'\n')
elif skewness < -.25:
print(x,"is Negatively Skewed towards Left side of asymmetric distribution as Skewness =",round(skewness,3),'\n')
- A skewness value near 0 in the output denotes of symmetrical distribution.
- A negative skewness value in the output indicates an asymmetrical distribution having the tail is larger towards the left hand side of the distribution.
- A positive skewness value in the output indicates an asymmetrical distribution having the tail is larger towards the right hand side of the distribution.
The factors which geniunely affect the Personal Loan is to be studied from below tables
a = bp.corr()
a
A correlation of +1 indicates a perfect positive correlation.
A correlation of -1 indicates a perfect negative correlation.
A correlation of 0 indicates that there is no relationship between the different variables**
Getting Highly Correlated Values
def get_redundant_pairs(bp):
'''Get diagonal and lower triangular pairs of correlation matrix'''
pairs_to_drop = set()
cols = bp.columns
for i in range(0, bp.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
return pairs_to_drop
def get_top_abs_correlations(bp, n=10):
au_corr = bp.corr().unstack()
labels_to_drop = get_redundant_pairs(bp)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]
print("Top Correlations are as folows\n")
print(get_top_abs_correlations(bp, 10))
From this table following general conclusions can be made
There is strongly co-relation between Age and Experience .9 (approx), which means with the growth in age of the person experience aslo increases.
# To get a correlation matrix
# Ploting correlation plot
corr = bp.corr()
plt.figure(figsize=(15, 5))
# plotting the heat map
# corr: give the correlation matrix
# cmap: colour code used for plotting
# vmax: gives maximum range of values for the chart
# vmin: gives minimum range of values for the chart
sns.heatmap(corr, cmap='YlGnBu', vmax=1.0, vmin=-1.0)
# specify name of the plot
plt.title('Correlation between features')
plt.show()
As proper conclusions are not made regarding the features which geniunely affect the Personal Loan, so need to dive-in more for the same
sns.pairplot(bp,diag_kws=dict(fill=False))
print('Distribution of age column')
sns.distplot(bp['Age'], kde=True)
Age feature is normally distributed with majority of customers falling between 35 years and 55 years of age. We can infer from the graph above, and also in info attained from describe() shows mean is almost equal to median.
print('Distribution of Experience column')
sns.distplot(bp['Experience'], kde=True)
Experience is normally distributed with more customer having experience starting from 11 years to 30 Years. Here also the mean is equal to median.
print('Distribution of Income column')
sns.distplot(bp['Income'], kde=True)
Income is positively skewed. Majority of the customers have income between 45K and 55K. We can confirm this by saying the mean is greater than the median.
print('Distribution of CCAvg column')
sns.distplot(bp['CCAvg'], kde=True)
CCAvg is also a positively skewed variable and average spending is between 0K to 10K and majority spends less than 2.5K.
print('Distribution of Mortgage column')
sns.distplot(bp['Mortgage'], kde=True)
Mortgage 70% of the individuals have a mortgage of less than 40K. However the max value is 635K.
plt.figure(figsize=(30,45))
plt.subplot(6,2,1)
bp['Family'].value_counts().plot(kind="bar", align='center',color = 'blue',edgecolor = 'black')
plt.xlabel("Number of Family Members")
plt.ylabel("Count")
plt.title("Family Members Distribution")
plt.subplot(6,2,2)
bp['Education'].value_counts().plot(kind="bar", align='center',color = 'pink',edgecolor = 'black')
plt.xlabel('Level of Education')
plt.ylabel('Count ')
plt.title('Education Distribution')
plt.subplot(6,2,3)
bp['Securities Account'].value_counts().plot(kind="bar", align='center',color = 'violet',edgecolor = 'black')
plt.xlabel('Holding Securities Account')
plt.ylabel('Count')
plt.title('Securities Account Distribution')
plt.subplot(6,2,4)
bp['CD Account'].value_counts().plot(kind="bar", align='center',color = 'yellow',edgecolor = 'black')
plt.xlabel('Holding CD Account')
plt.ylabel('Count')
plt.title("CD Account Distribution")
plt.subplot(6,2,5)
bp['Online'].value_counts().plot(kind="bar", align='center',color = 'green',edgecolor = 'black')
plt.xlabel('Accessing Online Banking Facilities')
plt.ylabel('Count')
plt.title("Online Banking Distribution")
plt.subplot(6,2,6)
bp['CreditCard'].value_counts().plot(kind="bar", align='center',color = 'lightblue',edgecolor = 'black')
plt.xlabel('Holding Credit Card')
plt.ylabel('Count')
plt.title("Credit Card Distribution")
Family:It has 4 peaks(4 values) , families with least member is highest in the sample.
Eductaion :Mean and median is almost equal. Data is finely distributed. A few peaks shows different values dominance.
Securities Account :This attributes tells us that majorly cutomers are not having Security account.
CD Account:Most of the customers dont have CDaccounts.
Online:Higher number of customers use online banking in the sample.
Credit Card: This attribute has less customers using CC in comparison to the CC users.
The variables family and education are ordinal variables. The distribution of families is evenly distributed
It seems that many of the population is not holding Securities Account and CD Account, vast difference is visible
print('As there strong relation between Age and Experience of 0.96(approx),\nthe relation must too checked with Personal Loan')
sns.scatterplot(x="Age",y="Experience",data=bp,hue='Personal Loan')
Loan is offered to customers having Age more than 45yrs and Experience more than 10yrs
print('As there strong relation between Income and CCAvg of 0.65(approx),\nthe relation must too checked with Personal Loan')
sns.scatterplot(y="Income",x="CCAvg",data=bp,hue='Personal Loan')
print('As there is relation between Income and Mortgage of 0.20(approx),\nthe relation must too checked with Personal Loan')
sns.scatterplot(x="Income",y="Mortgage",data=bp,hue='Personal Loan')
sns.boxplot(y="Income",x="Family",data=bp,hue='Personal Loan')
Loan is granted only to those persons who have income more than 125$(approx), irrespective of thier number of family members
# Set up a grid of plots
fig = plt.figure(figsize=(20,10))
fig_dims = (1, 2)
print('Distribution of Personal Loan column')
plt.subplot2grid(fig_dims, (0,0))
bp['Personal Loan'].value_counts().plot(kind='bar', title='Personal Loan',edgecolor="0")
count_no_sub = len(bp[bp['Personal Loan']==0])
print('Total Count of Loan not offered :',count_no_sub)
count_sub = len(bp[bp['Personal Loan']==1])
print('Total Count of Loan offered:',count_sub)
pct_of_no_sub = count_no_sub/(count_no_sub+count_sub)
print("Percentage of Loan not offered", pct_of_no_sub*100)
pct_of_sub = count_sub/(count_no_sub+count_sub)
print("Percentage of Loan offered", pct_of_sub*100)
bp.groupby(bp['Personal Loan']).mean()
Observations:
1). The Average Income of customers who were offered loan is more than double of the Average Income of customers who didn’t offered loan.
2). The average customer spending on Credit Cards per month ($000) is also more than double for the customer's who offered loan.
3). The average mortage for loan availing customers is approximately double for the not availing customers.
4). Avg literacy is less for non loan takers.
#Looking at the attributes where the variance is less than 1
#a['Personal Loan']
v = bp.corr()
v['Personal Loan']
ID, and Zip Code can attribute can be dropped.
From this table following general conclusions can be made
As there is strongly co-relation between Age and Experience .9 (approx), which means with the growth in age of the customer, experience aslo increases, so for predicition for Personal Loan any of one the feature can be eliminated.
print('As there is relation between Personal Loan and Family of .06(approx),\ntherefore feature(family) must taken in account for predicition of model')
sns.boxplot(x='Personal Loan',y="Family",data=bp,)
Customers with more than 2 Family members ,are those who were mostly offered for Personal Loan
print('As there is relation between Personal Loan and Education of .0.13(approx),\ntherefore feature(Education) must taken in account for predicition of model')
sns.boxplot(x="Personal Loan",y='Education',data=bp)
Loan is offered mostly to only those customers who have done their Gradution and Advance/Professional Studies
print('As there is relation between Personal Loan and Mortgag of 0.14(approx),\ntherefore feature(Mortgag) must taken in account for predicition of model')
sns.boxplot(x="Personal Loan",y='Mortgage',data=bp)
There are many customer with High Mortgage Value and not have offered for Personal Loan but few customer with Less Mortgage Value were offered for Personal Loan
print('As there is relation between Personal Loan and CCAvg of 0.37(approx),\ntherefore feature(CCAvg) must taken in account for predicition of model')
sns.boxplot(x="Personal Loan",y="CCAvg",data=bp)
There are many customer who spend on Credit Cards and were not offered for Personal Loan but few customer who spend on Credit Cards were offerd for Personal Loan
print('As there is relation between Personal Loan and Income of .50(approx),\ntherefore feature(Income) must taken in account for predicition of model')
sns.boxplot(x="Personal Loan",y="Income",data=bp)
There are customers earning more than 170$ but were not offered for _Personal Loan_ but customer earning 125$(approx) - 170$(approx) were offered for Personal Loan
print('As there is relation between Personal Loan and Securities Account of .02(approx),\ntherefore feature(Income) must taken in account for predicition of model')
sns.countplot(x="Securities Account",hue="Personal Loan",data=bp)
Many customer having Securities Account were not offered for loan whereas only few customer were offered for loan having Securities Account
print('As there is relation between Personal Loan and Online of .006(approx),\ntherefore feature(Income) must taken in account for predicition of model')
sns.countplot(x="Online",hue="Personal Loan",data=bp)
#Dependent Variable is Personal Loan
y=bp["Personal Loan"]
bpx=bp.drop(['ID','Personal Loan','Experience', 'ZIP Code'],axis=1)
x=bpx
bpx.head(5)
accuracies={}
#Standardizing and Normalizing data
standardized_X = preprocessing.scale(x)
normalized_X = preprocessing.normalize(x)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
# import logistic regression and train on tarining set
model = LogisticRegression()
model.fit(x_train, y_train)
# Predict target column of test data
y_pred = model.predict(x_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
accuracies['LR'] = accuracy
print("Accuracy rate of",accuracy,"viz. considered as good accuracy")
print('Compute report precision recall F-measure support\n',classification_report(y_test, y_pred))
print("The Precision of model that predict the likelihood of a liability that the customer will buy personal loans is :",metrics.precision_score(y_test, y_pred))
metrics.confusion_matrix(y_test, y_pred)
Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions.
# Display confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)
df_conf_mat = pd.DataFrame(conf_mat)
plt.figure(figsize = (5,3))
sns.heatmap(df_conf_mat, annot=True,cmap='Blues', fmt='g')
# convert the features into z scores as we do not know what units / scales were used and store them in new dataframe
# It is always adviced to scale numeric attributes in models that calculate distances.
xScaled = x.apply(zscore) # convert all attributes to Z scale
# Creating odd list of K for KNN
myList = list(range(1,20))
# Subsetting just the odd ones
neighbors = list(filter(lambda x: x % 2 != 0, myList))
Choosing the right k is not easy and is subjective.
Usually choosen as an odd number.
A small k captures too much training noise and hence does not do well in test data. A very large k does so much smoothening that it does not manage to capture information in the training data sufficiently - and hence does not do well in test data.
# Empty list that will hold accuracy scores
ac_scores = []
# Perform accuracy metrics for values from 1,3,5....19
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(x_train, y_train)
# Predict the response
y_Pred = knn.predict(x_test)
# Evaluate accuracy
scores = accuracy_score(y_test, y_Pred)
ac_scores.append(scores)
# Changing to misclassification error
MSE = [1 - x for x in ac_scores]
# Determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d" % optimal_k)
# Build kNN Model
NNH = KNeighborsClassifier(n_neighbors= optimal_k , weights = 'distance' )
# Call Nearest Neighbour algorithm
NNH.fit(x_train, y_train)
# Predict target column of test data
predicted_labels = NNH.predict(x_test)
y_pred = NNH.predict(x_test)
accuracy = metrics.accuracy_score(y_test,predicted_labels)
accuracies['KNN'] = accuracy
print("Accuracy rate of",accuracy,"viz. considered as good accuracy")
print("The Precision of model that predict the likelihood of a liability that the customer will buy personal loans is :",metrics.precision_score(y_test, y_pred))
print('Compute report precision recall F-measure support\n',classification_report(y_test, y_pred))
confusion_matrix(y_test, y_pred)
Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions.
# calculate accuracy measures and confusion matrix
print("Confusion Matrix")
cm=metrics.confusion_matrix(y_test, predicted_labels, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in [1,0]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True,cmap='YlGnBu', vmax=1.0, vmin=-1.0)
scores =[]
for k in range(1,50):
NNH = KNeighborsClassifier(n_neighbors = k, weights = 'distance' )
NNH.fit(x_train, y_train)
scores.append(NNH.score(x_test, y_test))
plt.plot(range(1,50),scores)
we will plot the mean error for the predicted values of test set for all the K values between 1 and 40.
error = []
# Calculating error for K values between 1 and 40
for i in range(1, 50):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train, y_train)
pred_i = knn.predict(x_test)
error.append(np.mean(pred_i != y_test))
plt.figure(figsize=(12, 6))
plt.plot(range(1, 50), error, color='red', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
#Create a Gaussian Classifier
gnb = GaussianNB()
#Train the model using the training sets
gnb.fit(x_train, y_train)
#Predict the response for test dataset
y_pred = gnb.predict(x_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
accuracies['NB'] = accuracy
print("Accuracy:",accuracy)
print('Compute report precision recall F-measure support\n',classification_report(y_test, y_pred))
Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions.
print("The Precision of model that predict the likelihood of a liability that the customer will buy personal loans is :",metrics.precision_score(y_test, y_pred))
confusion_matrix(y_test, y_pred)
Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions.
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, square=True, annot=True,cmap='YlGnBu', vmax=1.0, vmin=-1.0)
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.figure(figsize = (8,5))
plt.yticks(np.arange(0,100,10))
sns.barplot(x = list(accuracies.keys()), y = list(accuracies.values()))
y_cm_lr = model.predict(x_test)
y_cm_nb = gnb.predict(x_test)
y_cm_knn = knn.predict(x_test)
from sklearn.metrics import confusion_matrix
cm_lr = confusion_matrix(y_test, y_cm_lr)
cm_knn = confusion_matrix(y_test, y_cm_knn)
cm_nb = confusion_matrix(y_test, y_cm_nb)
plt.figure(figsize = (8,8))
plt.subplots_adjust(wspace = 0.4, hspace = 0.4)
plt.subplot(2,2,1)
plt.title("LR Confusion Matrix")
sns.heatmap(cm_lr,annot=True,fmt="d",cbar=False, annot_kws={"size": 12})
plt.subplot(2,2,2)
plt.title("KNN Confusion Matrix")
sns.heatmap(cm_knn, annot = True, fmt = 'd', cbar = False, annot_kws = {"size": 12})
plt.subplot(2,2,3)
plt.title("NB Confusion Matrix")
sns.heatmap(cm_nb, annot = True, fmt = 'd', cbar = False, annot_kws = {"size": 12})
accuracies
The Logistic Regression model has the best accuracy of the train and test set is almost similar and also the precsion and recall accuracy is good. The Logistic Regression confusion matrix is also better in comparision to other models.
The KNN model has less accuracy of the train and test set as it is distance based which not perfect for this situation. As the requirement is to classify the target. Confusion matrix tells that is correct predictions is not that much acceptable.
The Naive Bayes giving the ccuracy less in comaprision to other models meaning the probability of determing the target correctly is less.